\documentclass[12pt,]{article}
\usepackage{lmodern}
\usepackage{amssymb,amsmath}
\usepackage{ifxetex,ifluatex}
\usepackage{fixltx2e} % provides \textsubscript
\ifnum 0\ifxetex 1\fi\ifluatex 1\fi=0 % if pdftex
  \usepackage[T1]{fontenc}
  \usepackage[utf8]{inputenc}
\else % if luatex or xelatex
  \ifxetex
    \usepackage{mathspec}
  \else
    \usepackage{fontspec}
  \fi
  \defaultfontfeatures{Ligatures=TeX,Scale=MatchLowercase}
\fi
% use upquote if available, for straight quotes in verbatim environments
\IfFileExists{upquote.sty}{\usepackage{upquote}}{}
% use microtype if available
\IfFileExists{microtype.sty}{%
\usepackage{microtype}
\UseMicrotypeSet[protrusion]{basicmath} % disable protrusion for tt fonts
}{}
\usepackage[margin=1in]{geometry}
\usepackage{hyperref}
\hypersetup{unicode=true,
            pdftitle={Achieving Statistical Significance with Control Variables and without Transparency},
            pdfauthor={Gabriel S. Lenz; Alexander Sahn},
            pdfborder={0 0 0},
            breaklinks=true}
\urlstyle{same}  % don't use monospace font for urls
\usepackage{natbib}
\bibliographystyle{apsr}
\usepackage{graphicx,grffile}
\makeatletter
\def\maxwidth{\ifdim\Gin@nat@width>\linewidth\linewidth\else\Gin@nat@width\fi}
\def\maxheight{\ifdim\Gin@nat@height>\textheight\textheight\else\Gin@nat@height\fi}
\makeatother
% Scale images if necessary, so that they will not overflow the page
% margins by default, and it is still possible to overwrite the defaults
% using explicit options in \includegraphics[width, height, ...]{}
\setkeys{Gin}{width=\maxwidth,height=\maxheight,keepaspectratio}
\IfFileExists{parskip.sty}{%
\usepackage{parskip}
}{% else
\setlength{\parindent}{0pt}
\setlength{\parskip}{6pt plus 2pt minus 1pt}
}
\setlength{\emergencystretch}{3em}  % prevent overfull lines
\providecommand{\tightlist}{%
  \setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}
\setcounter{secnumdepth}{5}
% Redefines (sub)paragraphs to behave more like sections
\ifx\paragraph\undefined\else
\let\oldparagraph\paragraph
\renewcommand{\paragraph}[1]{\oldparagraph{#1}\mbox{}}
\fi
\ifx\subparagraph\undefined\else
\let\oldsubparagraph\subparagraph
\renewcommand{\subparagraph}[1]{\oldsubparagraph{#1}\mbox{}}
\fi

%%% Use protect on footnotes to avoid problems with footnotes in titles
\let\rmarkdownfootnote\footnote%
\def\footnote{\protect\rmarkdownfootnote}

%%% Change title format to be more compact
\usepackage{titling}

% Create subtitle command for use in maketitle
\providecommand{\subtitle}[1]{
  \posttitle{
    \begin{center}\large#1\end{center}
    }
}

\setlength{\droptitle}{-2em}

  \title{Achieving Statistical Significance with Control Variables and without
Transparency}
    \pretitle{\vspace{\droptitle}\centering\huge}
  \posttitle{\par}
    \author{Gabriel S. Lenz\footnote{\href{mailto:glenz@berkeley.edu}{\nolinkurl{glenz@berkeley.edu}}.
  Travers Department of Political Science, University of California,
  Berkeley. 210 Barrows Hall, Berkeley, CA 94720-1950} \\ Alexander Sahn\footnote{\href{mailto:asahn@berkeley.edu}{\nolinkurl{asahn@berkeley.edu}}.
  Travers Department of Political Science, University of California,
  Berkeley. 210 Barrows Hall, Berkeley, CA 94720-1950}}
    \preauthor{\centering\large\emph}
  \postauthor{\par}
      \predate{\centering\large\emph}
  \postdate{\par}
    \date{May 27, 2020}

\renewcommand{\familydefault}{\sfdefault}
\usepackage{hyperref}

\begin{document}
\maketitle
\begin{abstract}
How often do articles depend on suppression effects for their findings?
How often do they disclose this fact? By suppression effects, we mean
control-variable-induced increases in estimated effect sizes.
Researchers generally scrutinize suppression effects as they want
reassurance that researchers have a strong explanation for them,
especially when the statistical significance of the key finding depends
on them. In a re-analysis of observational studies from a leading
journal, we find that over 30\% of articles depend on suppression
effects for statistical significance. Although increases in key effect
estimates from including control variables are of course potentially
justifiable, none of the articles justify or disclose them. These
findings may point to a hole in the review process: journals are
accepting articles that depend on suppression effects without readers,
reviewers, or editors being made aware.
\end{abstract}

\newpage

Imagine you are reading a study that reports a statistically significant
finding and includes several control variables in the regression model.
You learn that this finding only emerges with the addition of these
control variables and that adding them increases the effect size
estimate, shifting it from nonsignificance to significance. After
learning this fact, you will want reassurance that researchers have a
strong explanation for the control-variable-induced increase in
estimated effect size, sometimes called a suppression effect. You will
especially want reassurance when the statistical significance of the key
finding depends on this suppression effect. To quote an introductory
statistics textbook: ``In fact, it is pretty much standard operating
procedure that when suppressors arise most researchers dismiss the
finding as a statistical artifact unless there is a very strong
theoretical explanation for the result''
\citep[254]{bobko_correlation_2001}. Likewise, Andrew Gelman and his
co-authors write: ``Suffice it to say that, generally, suppression
effects are considered statistical artifacts unless there is a strong
theoretical explanation for their occurrence''
\citep{crede_questionable_2016}.

In this paper, we investigate how often studies depend on suppression
effects. While many definitions exist, we define suppression effects as
increases in the key coefficient estimate that result from adding
control variables. This definition corresponds with Conger's
\citeyearpar[36-37]{conger_revised_1974} expansive and generally
accepted definition of a suppressor variable: ``a variable which
increases the predictive validity of another variable (or set of
variables) by its inclusion in a regression equation,'' where predictive
validity is assessed by the magnitude of the regression coefficient
\citep{tzelgov_suppression_1991, mackinnon_equivalence_2000}. This
definition includes reciprocal suppression or cooperative suppression,
which is a term some researchers use for the most familiar form of
suppression, where control variables increase the magnitude of the key
effect but do not change the sign. It also includes the somewhat less
common case where the key coefficient switches signs, sometimes called
negative or net suppression.\footnote{We acknowledge that negative
  suppression may not always fall under Conger's definition, such as
  when a coefficient estimate changes signs but is smaller in absolute
  magnitude. We continue to call such cases suppression.} Finally, it
includes the rare classical suppression, where the suppressor variable
is nearly uncorrelated with the dependent variable
\citep{horst_role_1941}. Several sources formally define and discuss
types of suppression
\citep{nickerson_simpson_2019, conger_revised_1974, lewis_suppression_1986, tzelgov_suppression_1991}.

Although suppression effects should be treated with scrutiny, they can
be justified when researchers have a strong explanation for their
existence, that is, a strong explanation for why their key effect
estimate is suppressed. Tesler and Sears \citeyearpar[chapter
6]{tesler_obama_2010} provide an example in the 2008 US presidential
election. To their surprise, they found that white Democrats with
``gender traditional'' attitudes tended to choose Hillary Clinton over
Barack Obama in the Democratic primary, even though Hillary Clinton was
a feminist icon. When they controlled for racial resentment, however,
the relationship between gender traditionalism and vote choice switched:
now the gender traditionalists were less likely to support Clinton. Why?
They show that gender traditionalism and racial resentment correlated
and that racial resentment much more strongly predicted vote choice.
Racial resentment therefore suppressed the positive association between
gender traditionalism and opposing Clinton. When controlled for, the
expected relationship emerged. In this case, the authors provided a
plausible explanation for suppression.

Even though suppression effects can be justified, they deserve greater
scrutiny for several reasons. The first is robustness. When researchers
have a strong bivariate relationship, one that holds up with every
control they can throw at it, they feel confident about it---it seems
robust. When a finding depends on suppression effects, however, it is
not robust to alternative model specifications---it is by definition
acutely sensitive to the specification, as in the Tesler and Sears
example. When readers are evaluating statistical findings, they
generally want to know which of these two worlds they're in: generally
robust to model specification or generally not. If the latter, they
require much greater confidence in the model specification that
generates the findings.

A second reason for skepticism of suppression effects is that they can
introduce bias that favors the authors in a nontransparent way. Consider
an example of a data generating process with one key variable and 10
control variables, the first five of which increase the key effect
estimate, and the remaining five decreases the key effect estimate. As a
researcher adds these controls sequentially to a model, the first five
will increasingly bias their key effect estimate upwards, as these are
omitted suppressors. They induce bias because they unmask bias from the
still omitted five control variables. Only when the researcher adds all
10 will she estimate an unbiased effect. In this example, a researcher
could include the first five control variables, but leave out some of
the second five, biasing the effect estimate upwards. Since readers
never know the ``true'' data generating process, they may not realize
that control variables have been excluded. Researchers have strong
incentives to publish statistically significant findings and, given the
numerous forking paths they face when making research decisions about
control variables, they may have many opportunities to intentionally or
unintentionally take advantage of suppression effects. While all control
variables deserve scrutiny, suppression effects can induce bias that
favors the authors, and do so in a way that may be hidden from readers,
and so arguably deserve greater scrutiny. In contrast, when control
variables reduce estimated effect sizes, they work against authors'
publishing incentives and so may require less scrutiny.\footnote{We
  quote additional authors on why researchers should scrutinize
  suppression effects in the supporting information. The idea that
  controls can introduce biases is unintuitive to some readers. This
  hypothetical example illustrates that they can when some but not all
  controls are included. Numerous articles have discussed bias
  introduced by controls
  \citep{bhattacharya_instrumental_2012, clarke_phantom_2005, cole_illustrating_2010, middleton_bias_2016, pearl_class_2010, pearl_invited_2011, pearl_causality_2009, wooldridge_should_2016, wyss_reducing_2014}.}

A third reason for skepticism is that control variables can introduce
bias through opaque mechanisms. The hypothetical example above
illustrates the potential for bias that favors authors from confounding.
But control variables can also introduce bias that favors authors
through amplification bias or through mediation, to name just two
\citep{middleton_bias_2016, pearl_class_2010, mackinnon_equivalence_2000}.
Given the complexity of some of these effects, readers need to know if
findings could potentially depend on them.

As a reader, you therefore want to be alerted to suppression effects,
especially when the statistical significance of the key finding depends
on them. As a field, we want to know how often leading journals are
publishing such articles and especially how often they are doing so
without alerting readers, reviewers, and editors.

So, how often do articles depend on undisclosed suppression effects for
statistical significance? In this paper, we investigate this question by
analyzing replication data from a leading journal. We examine whether
findings depend on suppression effects by asking whether the main
estimate presented in the articles, which includes control variables, is
larger an absolute value than a bivariate estimate, which excludes those
controls. Specifically, we look for how often statistical significance
of the articles key result depends on the increase in estimated effect
size induced by control variables, that is, suppression effects. We find
a startling result: over 30\% of observational studies depend on
suppression effects for the statistical significance of their findings.
Moreover, we find that none of these studies disclose this fact.

These findings may point to a gap in the review process: journals
publish articles that depend on suppression effects for statistical
significance without the awareness of readers, reviewers, or editors.
They also reinforce concerns about the potential for researcher
discretion with control variables, concerns that have existed for
decades \citep{leamer_let_1983}. Replication efforts and meta-analyses
suggest that a reasonably large fraction of studies are false positives
or report much larger effects than actually exist
\citep{ioannidis_why_2005, ioannidis_power_2017, klein_investigating_2014, klein_many_2018}.
They have also pointed to suspicious patterns in test statistics
\citep{gerber_publication_2010, gerber_statistical_2008, brodeur_star_2016}.
Undisclosed suppression effects may be one source of these patterns.
Many solutions exist for this problem, some of which we discuss in our
conclusion. The simplest solution, however, is for reviewers and editors
to ask for a tad more transparency.

\hypertarget{data}{%
\section{Data}\label{data}}

To conduct this analysis, we replicate and reanalyze studies published
in the \emph{American Journal of Political Science} (AJPS). AJPS was one
of the earliest social science journals to adopt a firm data
transparency policy, enforcing the posting of replication data and code
for articles published beginning in 2013. Following our pre-analysis
plan,\footnote{See this link: \url{https://bit.ly/2luoV3E}. The plan
  specifies how we collected the data. We did not pre-specify the key
  analyses in this paper (Figure 3). We present the analyses we did
  specify in the supporting information and the results are consistent
  with our main findings.} we analyze articles from \emph{AJPS}
2013-2015 that focus mainly on establishing a single causal claim and
have a standard statistical model with at least one control variable
(see Table S1 in the supporting information for exclusion reasons). 64
of 163 articles in these years met these criteria. 49 of these are
observational and 15 are experimental. The topics of these studies range
widely, from the effects of judges having daughters on their rulings to
whether United Nations peacekeepers succeed in protecting civilians. On
average, these studies include 9 control variables in the fully
specified model.

We successfully reproduce the main findings in each of these articles,
including the coefficient estimates and the standard errors, as Figure 1
shows. Given the difficulties researchers have faced with
reproducibility \citep{king_replication_1995}, this result is
reassuring. In a handful of cases, the figure reveals that we estimate
different standard errors.

\begin{figure}
\centering
\includegraphics{covariates_paperpa_ca_files/figure-latex/unnamed-chunk-1-1.pdf}
\caption{Replication of Key Estimates and Standard Errors for Full
Specification Models}
\end{figure}

\hypertarget{frequency-and-disclosure-of-suppression-effects-in-observational-studies}{%
\section{Frequency and Disclosure of Suppression Effects in
Observational
Studies}\label{frequency-and-disclosure-of-suppression-effects-in-observational-studies}}

We first examine how often researchers use control variables to achieve
statistical significance through suppression effects in observational
studies (we examine experimental studies later). For each article, we
examine the effects of control variables on the estimated effect of the
key explanatory variable, the variable about which the article attempts
to establish a causal claim. To do so, we compare the p-value of the key
explanatory variable estimate in the full model (all control variables)
to a bivariate model. In cases where the research design or
identification strategy requires covariates (such as the main effects of
an interaction term, lagged dependent variables, and fixed effects), we
include these and count this as a bivariate specification. About half of
studies include some covariates in the ``bivariate specification,'' but
our main finding holds up when we look only at studies that did not
require them (see Figure 6 below and Figures S1, S2, and S4 in the
supporting information).

Given the nature of our analysis---examining the effects of whatever
control variables authors used---our definition of suppression effects
based on Cogner is expansive. Our definition includes all types of
suppression effects generated through any mechanism. Conger's definition
refers to a single suppressor but we are examining the effects of one or
more control variables---whatever the original authors used---and
therefore apply the concept to the general linear model originally
suggested by Holling \citeyearpar{holling_suppressor_1983}. Although our
analysis does not examine the types or mechanisms giving rise to
suppression, researchers' analysis should depend on the nature of the
suppression effect, a topic we return to later.

Since we are especially interested in how often findings depend on
undisclosed suppression effects, we examine the effect of controls
separately among articles that disclose or do not disclose a bivariate
estimate for their key effect. When researchers reveal a bivariate
estimate, readers can determine whether the key result depends on
suppression effects, justified or otherwise. To classify whether
researchers disclosed, we count studies as revealing the potential
presence of suppression effects if they report a bivariate estimate for
their key explanatory variable. We count this if they report it in a
statistical model or in some other form, such as a scatterplot,
cross-tab, or difference in means. We count these regardless of whether
they occur in figures, tables, the text, or in notes.

Figures 2 and 3 present the main finding of this paper.\footnote{Given
  the complexity of the findings we present, we show the same results
  with several alternative figure designs in the supporting information.
  See Figures S6 and S7a-S7d (we thank an anonymous reviewer for the
  suggestions).} Figure 2 begins by showing the effect of control
variables on the p-values of the key effect estimates and how often they
disclose this fact. The figure presents arrows that start at the p-value
of the bivariate specification and end at the p-value of the full
specification. When control variables lower p-values of the key
estimate, the arrows end pointing downwards. When control variables
increase p-values, the arrows end pointing upwards. The figure shows a
large number of downward arrows. When researchers include control
variables, their p-values drop a lot, many from well above 0.05 to below
0.05. In four cases, the introduction of controls switches the estimated
effect direction. To convey this switch, the figure shows the arrows
increasing from the bivariate specification p-value, to p=1, and then
back down to the p-value for the full specification.

These drops in p-values only represent suppression effects if they
result from increases in the absolute value of coefficients (as the
decreases could also result from smaller standard errors). Figure 3 adds
the additional information of how much of these p-value drops result
from coefficient changes. It presents the key result of this paper. To
show how much of these p-value drops results from coefficient changes,
Figure 3 alters the arrows so that the solid portion now shows the
changes in the p-value attributable only to coefficient changes, that
is, the change attributable to suppression effects. The dotted component
of the arrows shows the p-value changes attributable to only standard
error changes. As the solid portion of the arrows makes clear,
suppression effects drive much of the p-value decreases, though both
components contribute. Based on a careful examination of Figure 3,
suppression effects helped the authors lower their p-values towards 0.05
in about 20 of the 49 observational articles. Not all of these studies
achieved statistical significance at p\textless{}0.05 in the full
specification, but the decreases in p-values may have been necessary for
publication.

Figures 2 and 3 split the results into the left panel showing studies
that did not present a bivariate specification and the right panel
showing studies that did. Only a single one of the 20 studies revealed
the presence of suppression effects.

In sum, undisclosed suppression effects are common. We cannot know how
many of the 19 studies that did not disclose suppression would have been
published without the contribution from suppression effects, but it
seems possible that many or even all would not have. So, undisclosed
suppression effects may have contributed to the publication of as much
as (19/49 \(\approx\)) 40\% of observational studies. A conservative
estimate would be (15/49 \(\approx\)) 30\%. Another way of describing
this finding is that, when reading an observational article that does
not show a bivariate estimate in some form, readers should assume as
high as a (19/34 \(\approx\)) 55\% chance that it depends on suppression
effects for significance, a high probability.

\begin{figure}
\centering
\includegraphics{covariates_paperpa_ca_files/figure-latex/unnamed-chunk-3-1.pdf}
\caption{P-Value Changes in Observational Studies. Arrows for each study
show the total p-value changes from the bivariate to the full
specification (all controls). The figure shows that, when articles did
not present a bivariate specification, control variables lowered
p-values for the articles' key effect estimate. The four up and down
arrows reflect cases where the estimate changed sign from the bivariate
to the full specification. The next figure shows how much of the p-value
drops result from changes in coefficients from suppression effects
versus changes in standard errors.}
\end{figure}

\begin{figure}
\centering
\includegraphics{covariates_paperpa_ca_files/figure-latex/unnamed-chunk-4-1.pdf}
\caption{P-Value Changes in Observational Studies from Coefficient and
Standard Error Change. Arrows for each study show the total p-value
changes from the bivariate to the full specification (all controls). The
solid part of the arrows shows the p-value changes from only coefficient
estimate changes, while the dotted part shows the remaining p-value
changes from standard error changes. The figure shows that, when
articles failed to present a bivariate specification, they often depend
on suppression effects to achieve statistical significance. The four up
and down arrows reflect cases where the estimate changed sign from the
bivariate to the full specification. Figure 6 and Figures S1 and S2 in
the supporting information present robustness checks.}
\end{figure}

The effect of suppression on p-values is considerable, especially in
studies that do not reveal the presence of suppression effects. In
studies that are transparent by showing a bivariate, p-values actually
slightly increase (by 0.001). In studies that are not transparent,
p-values decrease by 0.33 on average. The difference in p-value changes
between those that do and do not is highly statistically significant
(p=0.0009, Wilcoxon rank-sum test). If we limit the p-value changes to
only that due to coefficient changes, non-transparent studies decrease
them by 0.29 on average and the difference in p-value changes remains
highly statistically significant (p=0.0027, Wilcoxon rank-sum test).

To show how control variables give rise to the p-value changes in
Figures 2 and 3, Figure 4 presents the log change in coefficients and
standard errors of key variables from the bivariate to the multivariate.
It uses log changes because they approximate percent changes (while
avoiding the symmetry and additivity problems of percent
changes).\footnote{To calculate the log change, we code the multivariate
  and bivariate estimates to positive. In the four cases where the signs
  flip between the bivariate and the multivariate, we rescale so that
  the bivariate estimate is zero and we add the bivariate to the
  multivariate. We then add one to all the estimates before taking the
  log. We also add one before calculating the log change in standard
  errors to keep the scale for coefficients and standard errors similar.
  Adding constants other than one doesn't substantively change the
  findings.} Each dot shows either the log change in coefficients or in
standard errors, and we sort dots by the p-value changes, as in Figures
2 and 3. Figure 4 makes clear that suppression effects occur in several
cases where they have no impact on the statistical significance of
studies. In such cases, suppression effects, even if not justified, are
less problematic since statistical significance does not depend on them.
The figure also makes clear that even small changes in effect sizes can
have large effects on statistical significance. Finally, this figure
illustrates the degree to which increased precision, a frequent
justification of the inclusion of control variables, isn't terribly
common.

\begin{figure}
\centering
\includegraphics{covariates_paperpa_ca_files/figure-latex/unnamed-chunk-5-1.pdf}
\caption{Key Coefficient and Standard Error Change in Observational
Studies. Each dot shows either the log change in coefficients (top) and
standard errors (bottom). The dots are sorted by the p-value changes,
the exact same order as in Figures 2 and 3.}
\end{figure}

How many of the undisclosed cases of suppression represent false
positives where? That is, how many of these are cases where researchers
are relying on suppression effects, intentionally or not, to publish
nonexistent effects? This is a hard question. We spent considerable time
reading and reanalyzing replication data for these articles, trying to
determine whether suppression effects could be justified, that is,
whether researchers could have provided a good explanation for their key
effect estimates being suppressed.

In some cases, we think they do. One example is Davenport
\citeyearpar{davenport_policyinduced_2015}, who examines the effect of
casualties and low draft numbers on parents' turnout during the Vietnam
War. She finds that parents whose children are at high risk of being
drafted (low lottery numbers) and who live in towns that have
casualties, are more likely to turn out. One can tell a simple story
about suppressor variables: poor regions of the country have lower
turnout and disproportionately contribute soldiers to combat roles in
Vietnam, so are hit disproportionately by casualties. Socio-economic
status may therefore suppress Davenport's key interaction estimate.
Indeed, in reanalyzing Davenport's data, we find that prior turnout
(which likely captures individual socio-economic status) and town
measures of socio-economic status increase the size of her key
interaction estimate, lowering its p-value from 0.11 in the bivariate
specification to 0.01 in the full specification. So, Davenport can
provide a good explanation for why her finding depends on suppression
effects.

For many articles, however, we cannot find an obvious justification for
suppression effects, though it is possible that authors could provide
one. In several of these articles, unusual control variables were the
key suppressors. These include variables that arguably should not be in
the models, such as post-treatment variables, and variables that change
the interpretation of the finding in a way that seems unintended by the
authors, such as including a lagged dependent variable when the article
is not about explaining change. These potentially questionable control
variables would have been harder to miss in the review process if
authors had alerted readers to the presence of suppression effects.

\hypertarget{frequency-and-disclosure-of-suppression-effects-in-experimental-studies}{%
\section{Frequency and Disclosure of Suppression Effects in Experimental
Studies}\label{frequency-and-disclosure-of-suppression-effects-in-experimental-studies}}

Suppression effects should be less common in experimental studies that
randomly assign treatment in large samples. In observational studies,
controls can change the key variable's estimate because they can
correlate with the dependent variable and key independent variable. When
researchers randomly assign the treatment in large samples, by contrast,
the correlation with controls is minimal, especially as the sample size
increases. Therefore, we would expect minimal sensitivity to control
variable choice. All the experiments we examine here used random
assignment and have relatively large Ns; the median N is 868 and
smallest is 156. Consistent with this expectation, Figure 5 reveals
little sign of p-value decreases in experimental studies with controls.
Randomized experiments have many advantages---one is less vulnerability
to discretion in control variable choice.

\begin{figure}
\centering
\includegraphics{covariates_paperpa_ca_files/figure-latex/unnamed-chunk-7-1.pdf}
\caption{P-Value Changes in Experimental Studies. For each article, the
arrows show the total p-value changes from the bivariate to the full
specification. The solid part of the arrows shows the p-value changes
from only coefficient estimate changes, while the dotted part shows
p-value changes from standard error changes.}
\end{figure}

\hypertarget{robustness-and-potential-objections}{%
\section{Robustness and Potential
Objections}\label{robustness-and-potential-objections}}

The main finding of this paper, shown in Figures 2 and 3, is robust:
when researchers don't disclose a bivariate, adding control variables
drops p-values consistently across various subsets of the data, as shown
in Figure 6. P-values drop when researchers use fixed effects or don't
use fixed effects and when we exclude the cases where the sign of the
effects flip. They drop in each of the three volumes of the \emph{AJPS},
in each of the three empirical subfields (American, comparative, and
international relations), and when the bivariate/minimal specification
has no covariates versus when we must include covariates for the minimal
specification to make sense. Finally, they also drop in studies with
less than the median number of controls, which is eight, and in studies
with more than the median number of controls (see also Figures S1 and S2
in the supporting information).

One potential alternative explanation for our key finding is that
researchers may not show the bivariate specification when controls favor
them because they are working on topics where suppressor variables are
well known. To address this possibility and to examine whether
researchers justified suppression effects, we carefully read the
sections on controls in these articles but found no acknowledgment or
justification. We present each articles' text relating to controls in
the supporting information (for articles that did not present a
bivariate specification).

Another account for our key finding comes from the incentives of the
journal review process. Researchers may believe that, if they disclose
suppression effects, reviewers will get hung up on them. The pattern we
observe---only one published study relies on suppression effects for
significance and discloses this fact---is consistent with this concern.
If this is authors' motivation, however, it points to a hole in the
review process, as reviewers should be assessing the plausibility of
control-induced increases in effect sizes.

\begin{figure}
\centering
\includegraphics{covariates_paperpa_ca_files/figure-latex/unnamed-chunk-8-1.pdf}
\caption{P-value Changes in Studies that Do Not Show Bivariate. This
plot shows the average change in p-values from the bivariate to the full
specification (with 95\% confidence intervals) for observational and
experimental studies that did not show a bivariate specification. The
mean number of controls in these studies is nine and the median is
eight. Figures S1 and S2 in the supporting information present similar
plots for all studies and for p-value changes only from coefficient
changes.}
\end{figure}

Some scholars have argued that showing multiple specifications in
regression tables, as researchers often do, is enough, especially when
estimates generally seem stable across the multiple specifications. We
believe, however, that researchers may underappreciate the degree to
which estimates vary across combinations of controls not shown.
Researchers often present a table of several regression specifications,
sequentially adding controls, but these represent a handful of thousands
of possible specifications. We show the range of effects researchers
could have produced in Figure 7, which presents the distribution of
t-statistics across all possible control combinations (randomly sampled
in cases with many control) for the observational studies. We show
t-statistics to scale the coefficients across these studies. As before,
the figure breaks the studies into those that show a bivariate
specification and those that do not. It also shows the t-statistic for
the bivariate specification (b) and the full multivariate specification
(m). The figure shows a large range of estimates in many of the studies
across control combinations, with estimates often crossing the threshold
for statistical significance and sometimes producing effects of opposite
signs. This figure understates the potential for discretion since it
only reflects the controls the authors chose to condition on. And, of
course, control variable choice is just one of several sources of
modeling discretion.

Another objection we have encountered is that readers do not need
bivariate specifications to assess the effects of control variable
discretion because they can merely look at the control variables
themselves. They can assess, for instance, whether researchers have left
out controls that would work against their key finding, suppressing its
effect. Undoubtedly, readers can do this in some cases. In our
reanalysis of these articles, however, some of the controls responsible
for increasing estimated effect sizes are unexpected, and we doubt
experts would have anticipated them.

Another objection is that researchers may be unable to justify the
effects of controls on their estimates because they are too complicated
to explain in a multidimensional context. Although this point has merit,
we think that researchers will often have priors about the
reasonableness of the net effect of controls. We also find that control
variables largely have similar effects on their own when conditioning on
other control variables (see Figure S3 in the supporting information).
More importantly, if researchers cannot be required to provide
justifications, then discretion will be left partly unchecked.

Finally, it's worth emphasizing that the p-value drops we observe from
adding controls do not generally result from large changes in the key
coefficient estimates. Although scaling changes across studies is
difficult, the key coefficients increase on average by only about 20\%,
calculated with log +1 changes, but those increases translate into large
p-value decreases (see Figures 4 and S4-S5 in the supporting
information). Researchers, therefore, are benefiting from suppression,
but relatively weak suppression, enough to shift their p-values below
0.05. Consistent with this pattern, multicollinearity does not seem
especially high in these studies, nor does it vary with p-value changes.
In the supplemental information, we present plots of the average
correlation between the key variable and control variables for all
studies (see supporting information Figure S8).

\begin{figure}
\centering
\includegraphics{covariates_paperpa_ca_files/figure-latex/unnamed-chunk-9-1.pdf}
\caption{Distribution of t-Statistics across All Possible Control
Variable Combinations for Observational Studies (randomly sampled in
cases with more than 10,000 combinations, sorted by variance). `m' shows
the full multivariate t-statistic and `b' the bivariate t-statistic. The
grey dotted lines mark t-statistics at -1.96 and 1.96, the p = 0.05
thresholds for conventional statistical significance. We exclude two
studies here because of the inordinate time required to estimate each
specification.}
\end{figure}

\hypertarget{discussion-and-conclusion}{%
\section{Discussion and Conclusion}\label{discussion-and-conclusion}}

Based on a reanalysis of political science studies, we find that
statistical significance depends on suppression effects in a large
fraction of observational studies. Almost none of these studies are
transparent about this fact. By increasing the estimates of key effect
sizes, undisclosed suppression effects may have contributed to the
publication of over 30\% of observational studies. When reading an
observational article that does not show a bivariate estimate in some
form, readers should assume nearly a 60\% chance that the article's key
effect depends on suppression for statistical significance. Suppression
effects in these articles may be well justified, but authors do not
acknowledge, let alone justify them. Without disclosure, readers,
reviewers, and editors cannot subject them to scrutiny, asking whether
the authors have a good justification for suppression.

It is important to be clear about the argument of this paper. We are in
no way claiming that suppression effects necessarily introduce bias.
They can be an important part of observational research, capturing data
generating processes whose outcomes would otherwise go unnoticed.
Instead, we are arguing that, because suppression effects can introduce
bias that favors authors, and do so in a nontransparent way, readers
typically want to be aware of them. We are also in no way claiming that
these 30-40\% of articles that rely on suppression effects to achieve
statistical significance are false positives. Instead, we are simply
pointing out that editors and reviewers would likely have scrutinized
the control variables more closely if they knew these articles depended
on suppression effects for their statistical significance. That
additional scrutiny may or may not change the publication outcomes for
these articles. We are also in no way arguing that bivariate estimates
are less biased or preferable. Instead, we are arguing that requiring
disclosure of bivariates alerts readers to suppression effects and the
extent of those effects. Readers want to know about their presence, we
think, because they mean that estimates are, by definition, not robust
to model specification, more vulnerable to hidden discretion by authors
in their choice of controls, and potentially produced by opaque
mechanisms such as amplification bias. In contrast, control variables
that reduce the magnitude of key estimates can also introduce bias, but
this bias works against authors.

The degree to which we find studies depending on suppression effects
highlights the problems with research oriented around p-value thresholds
\citep{mcshane_abandon_2019}. When researchers must report findings with
p-values below a certain threshold to publish, incentives for them to
``find a way'' become strong. This leads to numerous perverse
incentives, including model selection based on p-values.

As a field, we can limit discretion. Authors can assess robustness to a
wider array of control variable choices with Bayesian Model Averaging
\citep{bartels_specification_1997, montgomery_bayesian_2010, leamer_svalues_2016}
and related techniques. They can also use specification-curve analysis
in which they report their key effect estimate across all theoretically
justifiable model specifications \citep{simonsohn_specification_2015}.
Researchers can also limit the effects of model discretion by matching
before they analyze their data
\citep{hainmueller_entropy_2012, sekhon_multivariate_2011, imai_covariate_2014},
reducing model extrapolation \citep{ho_matching_2007a}. They could use
hold out samples and apply newly developed estimators for cases where
the number of control variables is large
\citep{athey_efficient_2016, ning_high_2017}. They can also preregister
control variables before data collection as part of a pre-analysis plan
\citep{casey_reshaping_2012, humphreys_fishing_2013}.

Most simply, researchers could disclose the bivariate specification to
allow reviewers and readers to assess the effects of adding control
variables.\footnote{Presenting bivariate relationships is important for
  yet another reason: they may help researchers infer the level of
  confounding
  \citep{RSSB:RSSB12167, oster_unobservable_2016, altonji_evaluation_2005}.}
If the bivariate specification departs noticeably from other
specifications, authors need to explain why. This is difficult in cases
with more than a handful of controls \citep{achen_new_2002a}, though see
Gelbach \citeyearpar{gelbach_when_2016} for a formalization. By
disclosing the bivariate, readers can assess whether control variables
could be introducing bias that favors the authors. Given the high rate
of false positives in published research, readers should know this basic
fact.

Besides limiting the potential for discretion, researchers should also
be attuned to the source of the suppression effects. In some cases,
researchers have strong justifications for them, such as in the Tesler
and Sears example \citeyearpar{tesler_obama_2010} and the Davenport
\citeyearpar{davenport_policyinduced_2015} examples discussed earlier.
In other cases, however, researchers should consider excluding the
suppressor variables, such as when the source is classical suppression,
mediation, or amplification bias.

\hypertarget{authors-note}{%
\section{Authors' Note}\label{authors-note}}

Funding for this research was provided by the Berkeley Initiative for
Transparency in the Social Sciences (BITSS), a program of the Center for
Effective Global Action (CEGA), with support from the Laura and John
Arnold Foundation. For helpful comments, we thank participants at the
Midwest Political Science Association 2017 Meeting, the Berkeley
Initiative for Transparency in the Social Sciences (BITSS) 2016 Annual
Meeting, the Berkeley Research Workshop in American Politics, and
Jasjeet Sekhon's methods seminar. We also thank Larry Bartels, Martin
Bisgaard, Allan Dafoe, Vicky Fouka, Donald Green, Andy Hall, Jens
Hainmueller, Daniel Hopkins, Macartan Humphreys, Greg Huber, Kosuke
Imai, Josh Kalla, Martin Vinaes Larsen, Yotam Margalit, John Bullock,
Seth Hill, John Marshall, Andrew McCall, David Nield, Mark Pickup,
Jasjeet Sekhon, Laura Stoker, and the anonymous reviewers. Replication
materials are available from Lenz and Sahn \citeyearpar{data2020}.

\bibliography{covariates.bib}


\end{document}
